Evaluating classifier performance with highly imbalanced Big Data
نویسندگان
چکیده
Abstract Using the wrong metrics to gauge classification of highly imbalanced Big Data may hide important information in experimental results. However, we find that analysis for performance evaluation and what they can or reveal is rarely covered related works. Therefore, address gap by analyzing multiple popular on three tasks. To best our knowledge, are first utilize new Medicare insurance claims datasets which became publicly available 2021. These all imbalanced. Furthermore, comprised completely different data. We evaluate five ensemble learners Machine Learning task fraud detection. Random Undersampling (RUS) applied induce class ratios. The classifiers evaluated with both Area Under Receiver Operating Characteristic Curve (AUC), Precision Recall (AUPRC) metrics. show AUPRC provides a better insight into performance. Our findings AUC metric hides impact RUS. results terms RUS has detrimental effect. that, Data, fails capture about precision scores false positive counts reveals. contribution more effective evaluating when working Data.
منابع مشابه
Mining Imbalanced Data with Learning Classifier Systems
This chapter investigates the capabilities of XCS for mining imbalanced datasets. Initial experiments show that, for moderate and high class imbalances, XCS tends to evolve a large proportion of overgeneral classifiers. Theoretical analyses are developed, deriving an imbalance bound up to which XCS should be able to differentiate between accurate and overgeneral classifiers. Some relevant param...
متن کاملEvaluating Misclassifications in Imbalanced Data
Evaluating classifier performance with ROC curves is popular in the machine learning community. To date, the only method to assess confidence of ROC curves is to construct ROC bands. In the case of severe class imbalance with few instances of the minority class, ROC bands become unreliable. We propose a generic framework for classifier evaluation to identify a segment of an ROC curve in which m...
متن کاملTwo Stage Comparison of Classifier Performances for Highly Imbalanced Datasets
During the process of knowledge discovery in data, imbalanced learning data often emerges and presents a significant challenge for data mining methods. In this paper, we investigate the influence of class imbalanced data on the classification results of artificial intelligence methods, i.e. neural networks and support vector machine, and on the classification results of classical classification...
متن کاملEvaluating Difficulty of Multi-class Imbalanced Data
Multi-class imbalanced classification is more difficult than its binary counterpart. Besides typical data difficulty factors, one should also consider the complexity of relations among classes. This paper introduces a new method for examining the characteristics of multi-class data. It is based on analyzing the neighbourhood of the minority class examples and on additional information about sim...
متن کاملComplexity Curve: a Graphical Measure of Data Complexity and Classifier Performance Supplementary document S2: Evaluating Classifier Performance with Generalisation Curves
We discussed the role of data complexity measures in the evaluation of classification algorithms performance. Knowing characteristics of benchmark data sets it is possible to check which algorithms perform well in the context of scarce data. To fully utilise this information, we present a graphical performance measure called generalisation curve. It is based on learning curve concept and allows...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Journal of Big Data
سال: 2023
ISSN: ['2196-1115']
DOI: https://doi.org/10.1186/s40537-023-00724-5